An Improved Logging and Checkpointing Scheme for Recoverable Distributed Shared Memory
نویسندگان
چکیده
The distributed shared memory(DSM) system transforms an existing network of workstations to a powerful shared-memory parallel computer which could deliver superior price/performance. However, with more workstations engaged in the system and longer execution time, the probability of faults increases which could render the system useless. Several checkpointing and logging schemes have been proposed to enable the DSM system to continue work after transient failures. Using checkpoints, it is not necessary to roll back to the beginning of the process but the processes need to roll back to the latest checkpoint. The logging is introduced to further reduce the amount of rollback propagation on other related processes. Although logging makes the rollback propogation unnecessary , it introduces the overhead for the logging itself. If it is needed to log all the read/write operations, the logging overhead would be prohibitive. Moreover, some of the logging methods proposed earlier could result in incorrect recovery when processes synchronize using barriers. In this paper, we propose a novel logging scheme which greatly reduces the amount of logging by not loging all the pages accessed but logging only the pages which are invalidated. The performance our proposed scheme is analyzed using extensive simulation. Compared with two other schemes proposed earlier, our new logging scheme shows superior performance in various cases.
منابع مشابه
An efficient causal logging scheme for recoverable distributed shared memory systems
This paper presents a causal logging scheme for the lazy release consistent distributed shared memory systems. Causal logging is a very attractive approach to provide the fault tolerance for the distributed systems, since it eliminates the need of stable logging. However, since inter-process dependency must causally be transferred with the normal messages, the excessive message overhead has bee...
متن کاملUsing Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory
Distributed shared memory provides a useful paradigm for developing distributed applications. As the number of processors in the system and running time of distributed applications increase, the likelihood of processor failure increases. A method of recovering processes running in a distributed shared memory environment which minimizes lost work and the cost of recovery is desirable so that lon...
متن کاملReducing Interprocessor Dependence in Recoverable Distributed Shared Memory
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfe...
متن کاملA Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM). Although most recover...
متن کاملAn E cient Logging and Recovery Scheme for LazyRelease Consistent Distributed Shared
Checkpointing and logging are widely used techniques to provide fault tolerance for the distributed systems. However, logging imposes too much overhead on the processing to be a practi-val solution. In this paper, we propose a low-overhead logging scheme for the distributed shared memory system based on the lazy release consistency model. Unlike the previous schemes in which the logging is perf...
متن کامل